This report explores a dataset containing 1,599 red wines with 13 variables, 11 variables on the chemical properties of the wine, as well as the quality rating by experts and the identifier variable X.
## [1] 1599 13
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
Our dataset consists 13 variables with almost 1600 observations. The first variable ‘X’ is the identifier variable of red wine and the last variable quality is the quality rating of the red wine by experts. All other variables are attributes of the red wine.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.636 6.000 8.000
##
## 3 4 5 6 7 8
## 10 53 681 638 199 18
The first thing that I wanted to explore is the distribution of the quality of red wines. According to the plot above, the quality of red wines is normally distributed.
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.090 0.260 0.271 0.420 1.000
##
## 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0.11 0.12 0.13 0.14
## 132 33 50 30 29 20 24 22 33 30 35 15 27 18 21
## 0.15 0.16 0.17 0.18 0.19 0.2 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28 0.29
## 19 9 16 22 21 25 33 27 25 51 27 38 20 19 21
## 0.3 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39 0.4 0.41 0.42 0.43 0.44
## 30 30 32 25 24 13 20 19 14 28 29 16 29 15 23
## 0.45 0.46 0.47 0.48 0.49 0.5 0.51 0.52 0.53 0.54 0.55 0.56 0.57 0.58 0.59
## 22 19 18 23 68 20 13 17 14 13 12 8 9 9 8
## 0.6 0.61 0.62 0.63 0.64 0.65 0.66 0.67 0.68 0.69 0.7 0.71 0.72 0.73 0.74
## 9 2 1 10 9 7 14 2 11 4 2 1 1 3 4
## 0.75 0.76 0.78 0.79 1
## 1 3 1 1 1
Above I ploted three variables that are related to acid: fixed acidity, volatile acidity and citric acid. It seems like the distribution of fixed acidity and volatile acidity are a little bit right skewed. And citric acid is differently distributed from the two. There are two obvious peaks in citric acid, one of them is 0 and the other one is 0.5.
Next, since the density depends on the the percentage of alcohol and suger in the red wine, I’m going to explore these three variables together.
According to the plots above, the density of the red wine is normally distributed and the distribution of alcohol amount are right skewed. Residual sugar has a long tail, and pretty distant outliers so I’m going to do a little transformation.
After the log tranformation, the distribution looks more normal.
Next, I will plot three other ralated variables:free sulfur dioxide, total sulfur dioxide and sulphates.
These three variables are closely related according to the provided text documentation, and the plot above shows that they all have pretty strong right skewness. I will do a little tranformation on the plots to see if they will look more normally distributed if log transformation is applied.
After the transformation, especially the total sulfur dioxide and the sulphates look a lot more normally distributed.
There are 1599 observations of red wine, with 13 variables(X, fixed.acidity, volatile.acidity, citric.acid, residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide, density, pH, ulphates, alcohol, quality). X and quality are integers and all other variables are floating point numbers.
Other observations:
The red wine quality is normally distributed, and most red wines here have a quality of 5 and 6.
The density difference between different wines in our dataset is very small, the minimum density is 0.9901, and the maximum density is 1.0037.
About 75% red wines have the volatile acidity less than or equal to 0.6400 g / \(dm^3\).
Most red wines have free sulfur dioxide less than 60 mg / \(dm^3\).
The main features in the dataset are the quality of the red wine and the chemical properties such as acidity, sugar, chlorides, sulfur dioxide and alcohol. I’d like to determine which chemical properties influence the quality of red wines.
Volatile acidity, citric acid, total sulfur dioxide, alcohol and some combination of the other variables can be helpful in building a predictive model to the quality of red wine. According to the text documentation provided alongside with the red wine data, volatile acidity, citric acid and total sulfur dioxide will affect the taste of the wine, hense I think these variables will contribute most to the quality rate of the red wine.
No. Maybe in the future, when I have the inspiration that the mutation or the combination of the current variables is helpful in exploring the dataset, I will create some new variables.
The distribution of the total sulfur dioxide, free sulfur dioxide, sulphates and residual sugar are skewed right with some outliers, so I did log transformation on them. For the residual sugar, I also cropped the data a liitle bit the better look at the majority of the data.
## X fixed.acidity volatile.acidity citric.acid
## X 1.000 -0.268 -0.009 -0.154
## fixed.acidity -0.268 1.000 -0.256 0.672
## volatile.acidity -0.009 -0.256 1.000 -0.552
## citric.acid -0.154 0.672 -0.552 1.000
## residual.sugar -0.031 0.115 0.002 0.144
## chlorides -0.120 0.094 0.061 0.204
## free.sulfur.dioxide 0.090 -0.154 -0.011 -0.061
## total.sulfur.dioxide -0.118 -0.113 0.076 0.036
## density -0.368 0.668 0.022 0.365
## pH 0.136 -0.683 0.235 -0.542
## sulphates -0.125 0.183 -0.261 0.313
## alcohol 0.245 -0.062 -0.202 0.110
## quality 0.066 0.124 -0.391 0.226
## residual.sugar chlorides free.sulfur.dioxide
## X -0.031 -0.120 0.090
## fixed.acidity 0.115 0.094 -0.154
## volatile.acidity 0.002 0.061 -0.011
## citric.acid 0.144 0.204 -0.061
## residual.sugar 1.000 0.056 0.187
## chlorides 0.056 1.000 0.006
## free.sulfur.dioxide 0.187 0.006 1.000
## total.sulfur.dioxide 0.203 0.047 0.668
## density 0.355 0.201 -0.022
## pH -0.086 -0.265 0.070
## sulphates 0.006 0.371 0.052
## alcohol 0.042 -0.221 -0.069
## quality 0.014 -0.129 -0.051
## total.sulfur.dioxide density pH sulphates alcohol
## X -0.118 -0.368 0.136 -0.125 0.245
## fixed.acidity -0.113 0.668 -0.683 0.183 -0.062
## volatile.acidity 0.076 0.022 0.235 -0.261 -0.202
## citric.acid 0.036 0.365 -0.542 0.313 0.110
## residual.sugar 0.203 0.355 -0.086 0.006 0.042
## chlorides 0.047 0.201 -0.265 0.371 -0.221
## free.sulfur.dioxide 0.668 -0.022 0.070 0.052 -0.069
## total.sulfur.dioxide 1.000 0.071 -0.066 0.043 -0.206
## density 0.071 1.000 -0.342 0.149 -0.496
## pH -0.066 -0.342 1.000 -0.197 0.206
## sulphates 0.043 0.149 -0.197 1.000 0.094
## alcohol -0.206 -0.496 0.206 0.094 1.000
## quality -0.185 -0.175 -0.058 0.251 0.476
## quality
## X 0.066
## fixed.acidity 0.124
## volatile.acidity -0.391
## citric.acid 0.226
## residual.sugar 0.014
## chlorides -0.129
## free.sulfur.dioxide -0.051
## total.sulfur.dioxide -0.185
## density -0.175
## pH -0.058
## sulphates 0.251
## alcohol 0.476
## quality 1.000
From the correlation matrix above we can see that the top factors that are correlated with the red wine quality are : alcohol(0.467), volatile.acidity(-0.391), sulphates(0.251), citric.acid(0.226).
Let’s use the boxplot to take a better look at these four variables that correlated with the quality of red wines.
First, let’s plot quality with alcohol:
It seems that starting from the quality 5, the more alcohol the red wines contain, the better quality they have.
Secondly, let’s plot quality with volatile.acidity:
There is very obvious pattern that the more volatile acid red wines contain, the worse quality they have. And it agrees with the text documentation provided with the red wine data, ‘volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste’.
Thirdly, let’s take a look at the quality with sulphates:
According to the plot above, the more the sulphates, the better the quality of the red wine, but it seems that the effect of sulphates is not as significant as the other two variables, alcohol and volatile acidity.
Finally, I’m going to plot quality with citric.acid:
According to the author’s documentation provided alongside with the red wine data, ‘Citric acid can add ’freshness’ and flavor to wines’. And our plot proves this. The plot shows that the quality and citric acid are positively correlated.
Besides the most correlated four variables above, I also noticed that the density and the quality have a correlation of -0.175, the correlation is not as strong as the four variables that I analized above, but is still pretty strong compared with other variables. I suspect that part of this is because the density has a -0.496 correlation with alcohol, and alcohol is one of the variables that most strongly correlated with the qulity.
It makes sense. The amount of alcohol is positively related with quality and density is negatively related with alcohol, that is why we see a negative correlation between density and quality.
According to the plots and data above, the most decisive feature of the quality of red wines are alcohol(0.467), volatile.acidity(-0.391), sulphates(0.251), citric.acid(0.226). The amounts of alcohol, sulphates and citric acid are positively related with the quality of the wine and the other decisive variable volatile acidity is negatively related with the quality of the red wines.
I noticed that the density and the quality have a correlation of -0.175. And I suspect that part of because the density has a -0.496 correlation with alcohol, and alcohol is one of the variables that most strongly and positively correlated with the qulity.
The strongest relationship I found is alcohol and quality, which are positively correlated.
I made 6 plots about the 4 variables that most correlated with the quality. And in these plots, I used different color to represent diffrent quality. We can see that most high quality red wines have high alcohol level and low volatile acidity. And high quality red wines have relatively high sulphates and citric acid.
The four most decisive variables: alcohol, volatile.acidity, sulphates, citic.acid strengthen each other. All 6 plots seem to show that the high quality wines tend to have high level of alcohol, citric.acid and sulphates, and low level of volatile.acidity.
According to the text documentation by the author, ‘volatile acidity at too high of levels can lead to an unpleasant, vinegar taste’. I noticed that red wines that contain volatile acidity higher than 1.1 almost never get the high quality rating.
## 3 4 5 6 7 8
## 10 53 681 638 199 18
I choose the quality plot as my first plot because quality is the number one feature I care about in this research. According to the plot, most red wines have a quality of 5 or 6, few get a quality rating of 4 and even fewer get 3 and 8, which are two extremes of red wine quality. Also, I noticed that even though the quality rating ranges from 0 to 10 according to the author’s documentation, no red wine receives a rating of less than 3 or higher than 10. This may be due to how the quality rating comes from. According the information provided by the author, at least 3 wine experts rated the quality of each wine. And I guess the final result is based on the average rating or the median rating of the experts, since it’s unlikely that all of them give a full rating or a rating of 0, it makes sense that extreme ratings like 0 and 10 doesn’t exist.
For plot two, I choose the boxplots of quality with four variables: alcohol, volatile acidity, sulphates and citric acid, because these four variables are the four most influential variable on the quality of red wines. Firstly, higher alcohol level tend to have higher quality. Secondly, quality and volatile acidity are negatively correlated, however, there is no significant difference in volatile acidity for quality 7 and quality 8. Thirdly, quality and sulphates are positively correlated but the correlation is not as strong as the correlation of quality and alcohol. Finally, there is an obvious pattern that the red wines with higher quality have higher citric acid level.
I choose this plot because it plots qulity and the top two influential variables together in one plot. We can see that high quality red wines tend to favor high alcohol level, especially for alcohol level greater than 12% of the red wine volume. It also shows that high volatile acidity tends to prevent red wines from receiving a high quality rating.
In this research I explored a dataset containing 1,599 red wines with 13 variables, 11 variables on the chemical properties of the wine, as well as the quality rating by experts and the identifier variable X. Through the exploration of one variable, two variables and three variables, I found the four variables that influence the quality of red wine most: alcohol, volatile acidity, sulphates and citric acid. The plots and the correlation matrix show that alcohol, sulphates and citric acid are positively correlated with the quality of red wine while volatile acidity is negatively related with the quality.
I think everything went smoothly except for the following two things:
The correaltion between the quality and other variables are not as strong as I thought. In fact, the strongest correlation of quality is 0.467, with alcohol, although some correlation is stronger than the others.
Although three or more experts have rated each red wine, we only get one number for each red wine in the dataset, and this prevents us from seeing a more general picture of the distribution of the red wine quality. I think if we have more data about the red wine quality evaluation, we could do a more detailed analysis.